Defining the Checkpoint Interval for Uncoordinated Checkpointing Protocols
نویسندگان
چکیده
Parallel applications running on large computers suffer from the absence of a reliable environment. Fault tolerance proposals, in general, rely on rollback-recovery strategies supported by checkpoint and/or message logging. There are well-defined models that address the optimum checkpoint interval for coordinated checkpointing. Nevertheless, there is a lack of models concerning uncoordinated checkpointing combined with message logging. First we present a model designed for serial applications or coordinated checkpointing-based solutions. Our contribution is the extension of this model to a scenario based on uncoordinated checkpointing combined with message logging. We introduce two key points to minimise the fault tolerance overhead for parallel applications. The first is the use of a factor to represent the dependency relation between processes. The second is the use a specific checkpoint intervals for each process. Experiments show that our model performs as well as previous studies for serial applications or coordinated checkpointing. While running parallel applications using uncoordinated checkpointing combined with message logging, our checkpoint interval model effectively minimises the overhead introduced by the fault tolerance tasks. Moreover, the overhead prediction error is smaller than 5% for all applications tested.
منابع مشابه
On the Calculation of the Checkpoint Interval in Run-Time for Parallel Applications
The growth in the number of components that compose parallel computers increases their fault frequency. Currently, in such systems faults are no longer a rare event but a common problem, thus some sort of fault tolerance should be provided. In general, fault tolerance protocols rely on checkpoints. A common question surrounding checkpointing is the definition of the checkpoint interval. Checkpo...
متن کاملAn Enhanced MSS-based checkpointing Scheme for Mobile Computing Environment
Mobile computing systems are made up of different components among which Mobile Support Stations (MSSs) play a key role. This paper proposes an efficient MSS-based non-blocking coordinated checkpointing scheme for mobile computing environment. In the scheme suggested nearly all aspects of checkpointing and their related overheads are forwarded to the MSSs and as a result the workload of Mobile ...
متن کاملUnified model for assessing checkpointing protocols at extreme-scale
In this paper, we present a unified model for several well-known checkpoint/restart protocols. The proposed model is generic enough to encompass both extremes of the checkpoint/restart space, from coordinated approaches to a variety of uncoordinated checkpoint strategies (with message logging). We identify a set of crucial parameters, instantiate them, and compare the expected efficiency of the...
متن کاملCompiler Supported Interval Optimisation for Communication Induced Checkpointing
There exist mainly three different approaches of checkpoint-based recovery mechanisms for distributed systems: coordinated checkpointing, uncoordinated checkpointing and communication induced checkpointing. It can be shown that communication induced checkpointing theoretically has the least minimum overhead, but also that the effective overhead depends on the communication behaviour and the res...
متن کاملCoordinated Checkpoint versus Message Log for Fault Tolerant MPI
MPI is one of the most adopted programming models for Large Clusters and Grid deployments. However, these systems often suffer from network or node failures. This raises the issue of selecting a fault tolerance approach for MPI. Automatic and transparent ones are based on either coordinated checkpointing or message logging associated with uncoordinated checkpoint. They are many protocols, imple...
متن کامل